Method and apparatus for inverse tone mapping

ABSTRACT

Inverse tone mapping (ITM) aims at generating a single high dynamic range (HDR) image from a low dynamic range (LDR) image. While ITM was frequently used for graphics rendering in the HDR space, the advent of HDR consumer displays (e.g., HDR TV) and the consequent need for HDR multimedia contents open up new horizons for the consumption of ultra-high quality video contents. However, due to the lack of HDR-filmed contents, the legacy LDR videos must be up-converted for viewing on these HDR displays. Unfortunately, the previous ITM methods are not appropriate for HDR consumer displays, and their inverse-tone-mapped results are not visually pleasing with noise amplification or lack of details. In this paper, we propose a convolutional neural network (CNN) based architecture designed for the ITM to HDR consumer displays, called ITM-CNN, and its training strategy for enhancing the performance based on image decomposition using the guided filter. We demonstrate the benefits of decomposing the image by experimenting with various architectures and also compare the performance for different training strategies. To the best of our knowledge, this paper first presents the ITM problem using CNNs for HDR consumer displays, where the network is trained to restore lost details and local contrast. Our ITM-CNN can readily up-convert LDR images for direct viewing on an HDR consumer medium, and is a very powerful means to solve the lack of HDR video contents with legacy LDR videos.

TECHNICAL FIELD

At least one example embodiment relates to a method for inverse tonemapping and apparatuses performing the method.

BACKGROUND ART

The human visual system perceives the world as much brighter, withstronger contrasts and more details than is typically presented instandard dynamic range (SDR) displays. In comparison, recently availablehigh dynamic range (HDR) consumer displays allow users to enjoy videoscloser to reality as seen by the naked eye, with the brightness of atleast 1,000 cd/m2 (as opposed to 100 cd/m2 for SDR displays), highercontrast ratio, increased bit depth of 10 bits or more, and wide colorgamut (WCG). However, although HDR TVs are readily available in themarket, there is a severe lack of HDR contents.

Inverse tone mapping (ITM), also referred to as reverse tone mapping, isa popular area of research in computer graphics that aims to predict HDRimages from low dynamic range (LDR) images for better graphicsrendering. Another field of research, HDR imaging, makes use of multipleLDR images of different exposures to create a single HDR image thatcontains details in the saturated regions. In the above two fields ofresearch, the lighting calculations are conducted in the HDR domain withthe belief that this would yield a more accurate representation althoughHDR TVs are readily available in the market, there is a severe lack ofHDR contents.

Inverse tone mapping (ITM), also referred to as reverse tone mapping, isa popular area of research in computer graphics that aims to predict HDRimages from low dynamic range (LDR) images for better graphicsrendering. Another field of research, HDR imaging, makes use of multipleLDR images of different exposures to create a single HDR image thatcontains details in the saturated regions. In the above two fields ofresearch, the lighting calculations are conducted in the HDR domain withthe belief that this would yield a more accurate representation of thegraphic or natural scene on an SDR display. The HDR images are viewed onprofessional HDR monitors during rendering. Consequently, the HDR domainreferred to in the above areas are not necessarily the same as the nowavailable HDR consumer displays, and the resulting HDR images by the ITMmethods for such purposes are not suitable for direct viewing on an HDRTV. When the conventional ITM methods are applied for an HDR TV with themaximum brightness of 1,000 cd/m2, they are not capable of fullyutilizing the available HDR capacity due to their weakness in generatingfull contrast or/and details, or due to noise amplification as seen inFIG. 1. (Note that they are tone mapped for viewing on the paper.)

DISCLOSURE OF INVENTION Technical Problem

When the conventional ITM methods are applied for an HDR TV with themaximum brightness of 1,000 cd/m2, they are not capable of fullyutilizing the available HDR capacity due to their weakness in generatingfull contrast or/and details, or due to noise amplification.

Solution to Problem

Therefore, we formulate a slightly different problem where we aim togenerate HDR images that can be directly viewed on commercial HDR TVs.In this way, LDR legacy videos may be up-converted to be viewed onavailable HDR displays without additional information required. Wepropose an effective convolutional neural network (CNN) based structureand its learning strategy for up-converting a single LDR image of 8bits/pixel, gamma-corrected [20], in the BT.709 color container [25], toan HDR image of 10 bits/pixel through the perceptual quantization (PQ)transfer function [27] in the BT.2020 color container [26], that may bedirectly viewed with commercial HDR TVs.

BRIEF DESCRIPTION OF DRAWINGS

These and/or other aspects will become apparent and more readilyappreciated from the following description of example embodiments, takenin conjunction with the accompanying drawings of which:

FIG. 1 is a diagram illustrating conventional ITM Methods;

FIG. 2 is a diagram illustrating Architecture of our ITM-CNN;

FIG. 3 are Images decomposition using a guided filter;

FIG. 4 is a diagram illustrating Pre-train structure;

FIG. 5a , FIG. 5b and FIG. 5c are diagram illustrating differentarchitecture using image decompositions architecture;

FIG. 6 is a diagram illustrating comparisons with previous methods;

MODE FOR THE INVENTION

Hereinafter, some example embodiments will be described in detail withreference to the accompanying drawings. Regarding the reference numeralsassigned to the elements in the drawings, it should be noted that thesame elements will be designated by the same reference numerals,wherever possible, even though they are shown in different drawings.Also, in the description of embodiments, detailed description ofwell-known related structures or functions will be omitted when it isdeemed that such description will cause ambiguous interpretation of thepresent disclosure.

It should be understood, however, that there is no intent to limit thisdisclosure to the particular example embodiments disclosed. On thecontrary, example embodiments are to cover all modifications,equivalents, and alternatives falling within the scope of the exampleembodiments.

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to limit the disclosure. As usedherein, the singular forms “a,” “an,” and “the” are intended to includethe plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises,”“comprising,” “includes,” and/or “including,” when used herein, specifythe presence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof.

In addition, terms such as first, second, A, B, (a), (b), and the likemay be used herein to describe components. Each of these terminologiesis not used to define an essence, order or sequence of a correspondingcomponent but used merely to distinguish the corresponding componentfrom other component(s).

It should be noted that if it is described in the disclosure that onecomponent is “connected,” “coupled,” or “joined” to another component, athird component may be “connected,” “coupled,” and “joined” between thefirst and second components, although the first component may bedirectly connected, coupled or joined to the second component.

It should also be noted that in some alternative implementations, thefunctions/acts noted may occur out of the order noted in the figures.For example, two figures shown in succession may in fact be executedsubstantially concurrently or may sometimes be executed in the reverseorder, depending upon the functionality/acts involved.

Unless otherwise defined, all terms including technical and scientificterms used herein have the same meaning as commonly understood by one ofordinary skill in the art to which these example embodiments belong. Itwill be further understood that terms, such as those defined in commonlyused dictionaries, should be interpreted as having a meaning that isconsistent with their meaning in the context of the relevant art andwill not be interpreted in an idealized or overly formal sense unlessexpressly so defined herein.

Various example embodiments will now be described more fully withreference to the accompanying drawings in which some example embodimentsare shown. In the drawings, the thicknesses of layers and regions areexaggerated for clarity.

We propose the first CNN architecture, called ITM-CNN, for the ITMproblem to readily available HDR consumer displays.

Our architecture is a fully end-to-end CNN that is able to jointlyoptimize the LDR decomposition and HDR reconstruction phases.

This allows all legacy LDR images/video to be up-converted for directviewing on HDR TVs.

2. RELATED WORK

Tone mapping is a popular problem dealt in computer graphics and imageprocessing. When graphics are rendered in the HDR domain for betterrepresentation of the scenes which have near-continuous brightness andhigh contrast in the real world, they have to be tone mapped somehow tothe displayable range. ITM came later, to transfer the LDR domain imagesto the HDR domain. The term itself was first used by Banterle et al. in[14]. The previous ITM methods mainly focus on generating the expand mapand revealing contrast in saturated regions. In this section, previousmethods regarding ITM to the HDR domain will be reviewed. It should benoted that we address a slightly different problem in this paper thanthe previous methods, where our final goal is not viewing HDR renderedtone mapped images on SDR displays, nor viewing HDR images on aprofessional HDR monitor, but viewing generated HDR images directly onconsumer HDR displays (e.g., HDR TVs).

ITM was first introduced by [14, 17], where Banterle et al. formulatedthe inverse of Reinhard's tone mapping operator [18]. In addition to theinverse of the tone mapping operator, they also proposed an expand map,which specifies the amount of expansion for every pixel position. Twomain purposes of the expand map is to reduce contouring artifactsresulting from quantization and to expand the bright regions in the LDRimages. Similarly, Rempel et al. proposed a brightness enhancementfunction in [22]. The brightness enhancement function is derived fromthe blurred mask that indicates saturated pixel areas. With the edgestopping filter, the brightness enhancement function preserves edgeswith high contrast. Another related approach is [15, 16] where Kovaleskiand Oliveira used a bilateral grid to make an edge-preserving expandmap.

There are also methods that find a global mapping function of the wholeimage instead of a pixel-wise mapping. In [21], Meylan et al. appliedlinear expansions with two different slopes depending on whether thepixel is classified as the diffuse region or the specular region. Pixelswith values greater than a predefined threshold are classified into thespecular region and all other pixels are classified into the diffuseregion. The specular region is expanded with a steeper function than thediffuse region. The ITM algorithms presented above mainly classifypixels into two classes: pixels to be expanded more and pixels to beexpanded less. Usually, the pixels in the bright regions tend to beexpanded more.

Another method that investigates a global mapping function is [23]. In[23], Masia et al. evaluated a number of ITM algorithms and found thatthe performance of the algorithms decreased for overexposed inputimages. Based on this observation, they proposed an ITM curve based onthe gamma curve, where the parameter gamma is a function of thestatistics of the input image. Their following work [24] improves upontheir previous work with a robust multilinear regression model. In [19],Huo et al. proposed an ITM algorithm imitating the characteristics ofthe human visual system and its retina response. This ITM algorithmenhances local contrast.

Recently, CNN-based structures have shown exceptional performance inmodelling images to find a non-linear mapping, especially forclassification problems [11] or regression problems [12, 13]. There arevery few ITM methods based on CNNs. One of them, proposed by Endo et al.[7] is an indirect approach where they use a combination of 2D and 3DCNNs to generate multiple LDR images (bracketed images) of differentexposures from a single LDR image, and merge these bracketed imagesusing the existing methods to obtain a final tone mapped HDR image.Kalantari et al. [8] proposed an HDR imaging method where they used aCNN for integrating the given multiple

LDR images of different exposures to generate an HDR image which is tonemapped for viewing. Eilertsen et al. [9] and Zhang et al. [10] proposedencoder-decoder-based networks for HDR reconstruction which is also tonemapped for viewing.

However, the previous methods share an ultimate step of tone mapping theHDR-rendered images where the HDR domain referred to in their papers arethat to be viewed on professional HDR monitors for rendering operations.Consequently, they have not considered transfer functions such asPQ-EOTF [27, 28] or Hybrid Log Gamma [28] and color containers relatedto SDR or HDR format. Even when the transfer function and the colorcontainer are converted manually, HDR images converted by the previousmethods are not suitable for viewing on consumer HDR displays. In thispaper, we propose an ITM method with ITM-CNN, by which the resulting HDRimages can be directly viewed on commercial HDR TVs. The end-to-endCNN-based structure of our ITM-CNN benefits from image decompositionalong with delicate training strategies.

3. PROPOSED METHOD

Our proposed ITM-CNN has an end-to-end CNN structure for the predictionof the HDR image from a single LDR image.

3.1 Network Architecture

In tone mapping, edge-preserving filters (e.g. bilateral filter) arefrequently used on the HDR input to decompose the image into the baselayer and detail layer so that only the base layer is compressed whilepreserving the detail layer.

The processed base and detail layers are then integrated to obtain afinal LDR output image. In contrast, the purpose of an ITM algorithm isto predict lost details with an extended base layer to match the desiredbrightness to finally generate the output HDR image. If the image isdecomposed into two parts with different characteristics, appropriateprocessing may be done for the individual branches for a more accurateprediction of the output image. Following from this idea, we explicitlymodel our CNN structure (ITM-CNN) as three parts: (i) LDR decomposition,(ii) feature extraction and (iii) HDR reconstruction, as illustrated inFIG. 2.

The first part of our ITM-CNN, LDR decomposition, consists of threeconvolution layers where the number of output channels for the lastlayer is six, intended to decompose each LDR input image into twodifferent sets of feature maps, simply divided as the first three andthe last three feature maps. Then, the convolution layers in the featureextraction part proceed individually for the two sets so that each ofthe two CNN branches are able to focus on the characteristics of therespective inputs for specialized feature extraction. Lastly, for HDRreconstruction, the extracted feature maps from the individual passesare concatenated and the network (HDR reconstruction part) learns tointegrate the feature maps from the two passes to finally generate anHDR image. Our ITM-CNN jointly optimizes all the three steps of LDRdecomposition, feature extraction and HDR reconstruction, but has to betrained delicately to fully benefit from the image filtering idea(decomposition of LDR input).

3.2 Training Strategy

First, we pre-train the feature extraction and HDR

reconstruction parts of the ITM-CNN after setting the LDR decompositionpart as a guided-filtering-based separation of the base and detaillayers for the LDR input. The pre-training structure of the ITM-CNN isillustrated in FIG. 4. The guided filter [1, 2] is an edge-preservingfilter that does not suffer from gradient reversal artifacts like thebilateral filter [3]. The base layer is extracted using the self-guidedfilter as suggested in [1, 2], and the detail layer is obtained byelement-wise division of the input LDR image by the base layer, given as

Idetail=ILDR□Ibase  (1)

where ILDR is the LDR input, Idetail is the detail layer and Ibase isthe base layer. D in (1) denotes an element-wise division operator. Anexample of an image separated using the guided filter is given in FIG.3.

After pre-training the feature extraction and HDR reconstruction partsusing the pre-train structure with guided-filtering-separation, theguided filter is replaced with three convolution layers (LDRdecomposition part in FIG. 2) allowing for the final fully convolutionalarchitecture as given in FIG. 2. We pre-train the three layers in thenew LDR decomposition part with the same data but without updating theweights of later layers by setting the learning rate to zero for thoselayers, so that the convolution layers learn to decompose the LDR inputimage into feature maps that lower the final loss, while utilizing theweights in later layers that were trained with guided filter separation.When the pre-training is finished, the network (ITM-CNN) is finallytrained end-to-end for joint optimization of all three parts. Weobserved a significant increase in performance by using thispre-training strategy.

4. EXPERIMENTS 4.1 Experiment Conditions

Data. We collected 7,268 frames of 3840×2160 UHD resolution of theLDR-HDR data pairs containing diverse scenes. The specifications aregiven in Table 1. The HDR video is professionally filmed and mastered,and both the LDR and HDR data are normalized to be in the range [0, 1].For the synthesis of training data, we randomly cropped 20 subimages ofsize 40×40 per frame with the frame stride of 30. This resulted in thetotal training data of size 40×40×3×4,860. For testing, we selected 14frames from six different scenes that are not included in the trainingset. All videos were converted to the YUV color space and all three YUVchannels were used for training. Although it is common to use Y channelonly, using all three channels is more reasonable for our ITM problemsince the color container also changes from BT.709 [25] to BT.2020 [26].The quantitative benefit of using all three channels is shown in Table 2when experimenting with a simple CNN structure.

TABLE 1 Transfer Color Data Type Bit Depth Function Container LDR  8bits/pixel Gamma BT. 709 HDR 10 bits/pixel PQ-EOTF BT. 2020

TABLE 2 Train Data Y only YUV PSNR of Y (dB) 44.36 46.36 PSNR of YUV(dB) 32.53 48.25

The huge difference in the PSNR when measured for all

three YUV channels is largely in part due to the color container andtransfer function mismatch of LDR and HDR images if the U and V channelsare not trained. Note that the Y channel benefits from the complementaryinformation of U and V channels.

Training parameters. The weight decay of the convolution filters wereset to 5×10-4 with that of biases set to zero. The mini-batch size was32 with the learning rate of 10-4 for filters and 10-5 for biases. Allconvolution filters are of size 3×3 and the weights were initializedwith the Xavier initialization [4] that draws the weights from a normaldistribution with the variance expressed with both the number of inputand output neurons. The loss function of the network (ITM-CNN) is givenby

$\begin{matrix}{{L(\theta)} = {\frac{1}{2n}{\sum\limits_{i = 1}^{n}\; {{{F\left( {I_{LDR};\theta} \right)} - I_{HDR}}}^{2}}}} & (2)\end{matrix}$

where θ is the set of model parameters, n is the number of trainingsamples, ILDR is the input LDR image, F is the non-linear mappingfunction of the ITM-CNN giving the prediction of the network as F(ILDR;θ), and IHDR is the ground truth HDR image. The activation function isthe rectified linear unit (ReLU) [5] given by

ReLU=max(0,x)  (3)

All network models are implemented with the MatConvNet [6] package.

4.2 Input Decomposition

Decomposing the LDR input lets the feature extraction layers toconcentrate on each of the decompositions. Specifically, we use theguided filter [1, 2] for input decomposition. We compare three differentarchitectures shown in FIG. 5 to observe the effect of decomposing theLDR input.

The first structure shown in FIG. 5a is a simple six-convolution-layerstructure with residual learning. Since both the input LDR image with 8bits/pixel and the ground truth HDR image with 10 bits/pixel arenormalized to be in the range [0, 1], we can simply model the network(FIG. 5a ) to learn the difference between the LDR and HDR image for amore accurate prediction. Although no decomposition is performed on theinput LDR image, residual learning may be interpreted as an additiveseparation of the output that allows the CNN (FIG. 5a ) to focus only onpredicting the difference between LDR and HDR.

The second structure shown in FIG. 5b uses the guided filter formultiplicative input decomposition, and has two individual passes whereone predicts the base layer and the other predicts the detail layer ofthe HDR image. The base and detail layer predictions are then multipliedelement-wise to obtain the final HDR image. By providing the groundtruth base and detail layers of the HDR image, this second structure canfully concentrate on the decompositions for the final prediction.

The last structure shown in FIG. 5c , also uses the guided filter formultiplicative input decomposition, but the feature maps from theindividual passes are concatenated for direct prediction of the HDRimage. This network learns the optimal integration operation that lowersthe final loss through the last three convolutional layers, whereas forthe structure in FIG. 5b , we explicitly force the network to model thebase and detail layers of the HDR image for an element-wisemultiplicative integration. Note that this last structure is the same asthe pre-train structure in FIG. 4.

TABLE 3 Structure (a)* (a) (b) (c)* (c) Layer Number of filter channels(input, output) 1  3, 32  3, 32  3, 45  3, 32  3, 32  3, 32  3, 32  3,32  3, 32 2 32, 32 32, 32 45, 45 32, 32 32, 32 32, 32 32, 32 32, 32 32,32 3 32, 32 32, 32 45, 48 32, 32 32, 32 32, 32 32, 32 32, 32 32, 32 432, 32 32, 32 48, 45 32, 32 32, 32 32, 52 64, 40 5 32, 32 32, 32 45, 4532, 32 32, 32 52, 48 40, 40 6 32, 3  32, 3  45, 3  32, 3  32, 3  48, 3 40, 3  Total  38, 592  38, 592  77, 760 77, 184  77, 328  77, 112parameters PSNR of Y 45.46 46.36 46.36 46.84 46.73 47.03 PSNR of YUV47.28 48.25 48.39 48.65 48.87 48.82

The results of the experiment are given in Table 3 where (a), (b) and(c) denote the structures illustrated in FIG. 5. For fair comparison, wetune the number of filters in the hidden layers so that the total numberof parameters for each of the structures are similar. We perform twoadditional experiments with structure (a) and (c) denoted by (a)* and(c)* where (a)* is the structure (a) without residual learning, and (c)*is the structure (c) with element-wise multiplication instead ofconcatenation for integrating the feature maps after the thirdconvolution layer. We compare the structures in terms of PSNR measuredonly for the Y channel and for all three YUV channels.

Even with similar number of parameters, there is a maximum PSNRdifference of 0.67 dB measured for Y channel only and 0.48 dB measuredfor YUV channels depending on whether input decomposition is used or notand how the decomposed inputs are treated. The highest performingstructure is the structure (c), although (c)* shows comparable resultsfor PSNR measured for YUV, where the network has the freedom to learnthe integration of two feature extraction passes. Comparing thestructures (a) and (b), we confirm that the input decomposition usingthe guided filter is highly beneficial. Also, for the simple CNNarchitecture in the structure (a), it is crucial to use residuallearning for improved prediction. Letting the convolution layers tofocus on specific input decompositions with different characteristics,and learn to combine information generated by the different branches isimportant in reconstructing a high quality HDR image

4.3 Effect of Pre-Training

We model a fully end-to-end CNN structure as illustrated in FIG. 2 byreplacing the guided filter based separation in the structure (c) withthree convolutional layers each with 32 filters of size 3×3. However, wefind that pre-training the network is essential to fully utilize thethree parts of the network (LDR decomposition, feature extraction andHDR reconstruction) as intended. Table 4 shows the results of the samenetwork, the fully convolutional network in FIG. 2, with differenttraining procedures. The specified order n corresponding to each of thethree parts in Table 4 indicate the training orders, where only thelayers of a specific part are trained in the nth order, and the weightsof other layers remain fixed. When the feature extraction and HDRreconstruction parts are marked as being trained first (denoted as ‘1st’in columns 3, 4 and 5 of Table 4), the LDR decomposition part wasreplaced with the guided filter separation for the pre-training. The‘All’ in row 5 of Table 4 indicates that all layers were trained,end-to-end.

If the whole network is simply trained end-to-end without anypre-training, it achieves 0.32 dB lower performance in PSNR of Y thanthe structure of FIG. 5c , even though it has three more convolutionlayers or 11,808 more filter parameters. However, if the pre-trainedfilter values (of FIG. 4 or 5 c) using the guided filter separation(instead of the LDR decomposition layers) are used to initialize thenetwork before the end-to-end training, the resulting PSNR performancejumps with 0.4 dB and 0.18 dB for Y and YUV respectively, thus exceedingthe best performing version in Table 3.

The highest PSNR performance can be obtained by also pre-training theLDR decomposition layers in conjunction with the pre-training by theguided filter separation network and the end-to-end training at the end.

TABLE 4 Network Part Order of Training LDR Decomposition — 2^(nd) —2^(nd) Feature Extraction — 1^(st) 1^(st) 1^(st) HDR Reconstruction —All 1^(st) — 2^(nd) 3^(rd) PSNR of Y (dB) 46.71 46.28 47.11 47.27 PSNRof YUV (dB) 48.80 48.15 48.98 49.21

TABLE 5 Banterle Huo Kovaleski Masia Meylan Rempel et al. et al. et al.et al. et al. et al. Ours Sequence [17] [19] [16] [24] [21] [22](ITM-CNN) Aquarium 20.78 28.57 23.24 23.96 23.59 24.80 32.24 Leaves19.53 29.47 23.48 19.64 23.05 25.13 29.95 Cuisine 17.47 28.73 21.5122.24 22.01 24.79 31.07 Average 19.26 28.92 22.74 21.95 22.88 24.9131.09

4.4 Comparisons with Conventional Methods

Since no previous method was explicitly trained for viewing withconsumer HDR displays, fair comparison with our method is difficult.When comparing the previous ITM methods, we set the maximum brightnessto 1,000 cd/m2, remove gamma correction and apply the expansion operatorin the linear space. After the expansion, the color container isconverted from BT. 709 to BT. 2020 and the PQ-EOTF transfer function isapplied for pixel values to be in logarithmic space. Note that ourmethod works directly in the logarithmic space without any conversionusing the transfer function or color container.

Another complication is tone mapping for viewing on paper or SDRdisplays. All images in this paper were tone mapped using the madVRrenderer using the MPC-HC player, heuristically found to be most similaras viewing with HDR consumer displays. Although this is not the exactapplication of our problem, the result images still support our approachto be valid, and show that the existing methods are not directlyapplicable for viewing on HDR consumer displays. The result images forsubjective comparisons are given in FIG. 6. PSNR performance comparisonis given in Table 5. More experimental results are provided as asupplementary material attached to this paper.

5. CONCLUSION

Despite that high-end HDR TVs are readily available in the market, HDRvideo contents are scarce. This entails the need for a means toup-convert LDR legacy video to HDR video for the high-end HDR TVs.Although existing ITM methods share a similar goal of up-converting,their ultimate objective is not to render the inversely-tone-mapped HDRimages on consumer HDR TV displays, but to transfer the LDR scenes tothe HDR domain for better graphics rendering on professional HDRmonitors. The resulting HDR images from previous ITM methods exhibitnoise amplification in dark regions and lack of local contrast orunnatural colors when being viewed on consumer HDR TV displays.

In this paper, we first present the ITM problem using CNNs for HDRconsumer displays, where the network is trained to restore lost detailsand local contrast. For an accurate prediction of the HDR image, thedifferent parts (LDR decomposition, feature extraction and HDRreconstruction) of our network must be trained separately prior toend-to-end training. Specifically, we adopt the guided filter for LDRdecomposition for the pre-training stage so that the later layers canfocus on the individual decompositions with separate passes. The HDRreconstruction part of ITM-CNN learns to integrate the feature maps fromthe two passes. The resulting HDR images are artifact-free, restorelocal contrast and details, and are closest to the ground truth HDRimages when compared to previous ITM methods. ITM-CNN is readilyapplicable to legacy LDR videos for their direct viewing as HDR videoson consumer HDR TV displays.

The units and/or modules described herein may be implemented usinghardware components and software components. For example, the hardwarecomponents may include microphones, amplifiers, band-pass filters, audioto digital convertors, and processing devices. A processing device maybe implemented using one or more hardware device configured to carry outand/or execute program code by performing arithmetical, logical, andinput/output operations. The processing device(s) may include aprocessor, a controller and an arithmetic logic unit, a digital signalprocessor, a microcomputer, a field programmable array, a programmablelogic unit, a microprocessor or any other device capable of respondingto and executing instructions in a defined manner. The processing devicemay run an operating system (OS) and one or more software applicationsthat run on the OS. The processing device also may access, store,manipulate, process, and create data in response to execution of thesoftware. For purpose of simplicity, the description of a processingdevice is used as singular; however, one skilled in the art willappreciated that a processing device may include multiple processingelements and multiple types of processing elements. For example, aprocessing device may include multiple processors or a processor and acontroller. In addition, different processing configurations arepossible, such a parallel processors.

The software may include a computer program, a piece of code, aninstruction, or some combination thereof, to independently orcollectively instruct and/or configure the processing device to operateas desired, thereby transforming the processing device into a specialpurpose processor. Software and data may be embodied permanently ortemporarily in any type of machine, component, physical or virtualequipment, computer storage medium or device, or in a propagated signalwave capable of providing instructions or data to or being interpretedby the processing device. The software also may be distributed overnetwork coupled computer systems so that the software is stored andexecuted in a distributed fashion. The software and data may be storedby one or more non-transitory computer readable recording mediums.

The methods according to the above-described example embodiments may berecorded in non-transitory computer-readable media including programinstructions to implement various operations of the above-describedexample embodiments. The media may also include, alone or in combinationwith the program instructions, data files, data structures, and thelike. The program instructions recorded on the media may be thosespecially designed and constructed for the purposes of exampleembodiments, or they may be of the kind well-known and available tothose having skill in the computer software arts. Examples ofnon-transitory computer-readable media include magnetic media such ashard disks, floppy disks, and magnetic tape; optical media such asCD-ROM discs, DVDs, and/or Blue-ray discs; magneto-optical media such asoptical discs; and hardware devices that are specially configured tostore and perform program instructions, such as read-only memory (ROM),random access memory (RAM), flash memory (e.g., USB flash drives, memorycards, memory sticks, etc.), and the like. Examples of programinstructions include both machine code, such as produced by a compiler,and files containing higher level code that may be executed by thecomputer using an interpreter. The above-described devices may beconfigured to act as one or more software modules in order to performthe operations of the above-described example embodiments, or viceversa.

A number of example embodiments have been described above. Nevertheless,it should be understood that various modifications may be made to theseexample embodiments. For example, suitable results may be achieved ifthe described techniques are performed in a different order and/or ifcomponents in a described system, architecture, device, or circuit arecombined in a different manner and/or replaced or supplemented by othercomponents or their equivalents. Accordingly, other implementations arewithin the scope of the following claims.

1. AN apparatus for inverse tone mapping comprising: end-to-end CNN thatis able to jointly optimize the LDR decomposition and HDR reconstructionphases, wherein the end-to-end CNN allows all legacy LDR images/video tobe up-converted for direct viewing on HDR TVs.